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PARTIALLY OBSERVABLE RISK-SENSITIVE MARKOV DECISION 

PROCESSES 

NICOLE BAUERLE* AND ULRICH RIEDER* 


Abstract. We consider the problem of minimizing a certainty equivalent of the total or dis¬ 
counted cost over a hnite and an infinite time horizon which is generated by a Partially Observ¬ 
able Markov Decision Process (POMDP). The certainty equivalent is defined by U~^{KU{Y)) 
where U is an increasing function. In contrast to a risk-neutral decision maker, this optimization 
criterion takes the variability of the cost into account. It contains as a special case the classical 
risk-sensitive optimization criterion with an exponential utility. We show that this optimization 
problem can be solved by embedding the problem into a completely observable Markov Decision 
Process with extended state space and give conditions under which an optimal policy exists. 
The state space has to be extended by the joint conditional distribution of current unobserved 
state and accumulated cost. In case of an exponential utility, the problem simplifies consider¬ 
ably and we rediscover what in previous literature has been named information state. However, 
since we do not use any change of measure techniques here, our approach is simpler. A simple 
example, namely a risk-sensitive Bayesian house selling problem is considered to illustrate our 
results. 


Key WORDS: Partially Observable Markov Decision Problem, Certainty Equivalent, 
Exponential Utility, Updating Operator, Value Iteration. 


1. Introduction 


In this work we consider Partially Observable Markov Decision Processes (POMDP) under a 
general risk-sensitive optimization criterion for problems with finite and infinite time horizon. 
This is a continuation of our research published in [2]. More precisely our aim is to minimize 
the certainty equivalent of the accumulated total cost of a POMDP. In case of an infinite time 
horizon, costs have to be discounted. The certainty equivalent of a random variable is defined 
by C/“^(E17(A)) where U is an increasing function. If U{x) = x we obtain as a special case the 
classical risk-neutral decision maker. The case U{x) = is often referred to as ’risk-sensitive’, 
however the risk-sensitivity is here only expressed in a special way through the risk-sensitivity 
parameter 7 7 ^ 0. More general, the certainty equivalent may be written (assuming enough 
regularity oiU) as 

U-^(^E [C/(A)]) ^EX -^lu{EX)Var[X] (1.1) 


where 


lu{x) 


U"{x) 

U'{x) 


is the Arrow-Pratt function of absolute risk aversion. In case of an exponential utility, this 
absolute risk aversion is constant (for a discussion see [5]). If [/ is concave, the variance is 
subtracted and the decision maker is risk seeking in case cost is minimized, if U is convex, then 
the variance is added and the decision maker is risk averse. 

In case of complete observation it has been shown in [2] that this problem can be recast in 
the theory of Markov Decision Processes (MDP) by enlarging the state space with the total 
discounted cost that has been incurred so far. Numerical solution procedures via linear pro¬ 
gramming of these completely observable general risk-sensitive Markov Decision Processes can 
be found in m- The average cost version of this problem is treated in [ 7 ] and for an application 
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in insurance see [3]. Now we assume that only one of two components of a controlled Markov 
process can be observed. However, also the cost may depend on both components which leads 
to the situation that the cost incurred so far is an unobservable quantity. It is well-known that 
in case of a risk-neutral decision maker, the partially observable problem can be solved by a 
completely observable MDP when we enlarge the state space by the conditional distribution of 
the unobservable state, given the observable history of the process (see e.g. [T] chapter 5, m 
chapter 4 or m chapter 7). As far as the risk-sensitive problem is concerned we proceed in a 
similar way. This time however, the corresponding problem with complete observation possesses 
already an enlarged state consisting of the process state and the total discounted cost so far. 
Thus, to cope with the partially observable model we construct a Markov Decision Process where 
the state consists of the observable part of the state and the joint conditional distribution of the 
unobservable part of the state and unobservable total cost so far, given the observable history 
of the process. 

Early papers provided rigorous mathematical treatment of POMDPs with Borel state 

and action spaces. These references already present the solution procedure via the enlargement 
of the state space and the reduction to an ordinary Markov Decision Process. For a detailed 
discussion of the theory in the classical risk-neutral setting and for several applications see [Tj 
chapter 5 and |12] chapter 4. Risk-sensitive Markov Decision processes with the exponential 
utility have been discussed intensively since the seminal paper of |14) . For further references 
we refer the reader to [2]. Recent applications of this criterion in a wide range of portfolio 
optimization problems can be found in [8]. Papers which combine the exponential utility with 
POMDPs are among others [EllIIlEllISlE]. In all these papers a control model formulation has 
been used, where the true, unobservable and controlled state process is a Markov process (under 
Markovian policies) and observations are obtained by perturbed signals of this process. A change 
of measure technique is used to obtain independent signals. In order to apply MDP theory, the 
state space has been enlarged by a quantity that has been called an ’information vector’. In 
the present paper we use a more general model formulation where both parts (observable and 
unobservable state) are jointly Markovian and can be controlled jointly. This setting also covers 
the Bayesian case where the unknown state part is simply an unknown parameter. Also note 
that our optimization criterion is not restricted to the exponential utility and we do not need a 
change of measure technique to derive our filter. Moreover, the general approach implies a very 
natural interpretation for the ’information vector’ in the exponential utility case. Besides m 
all the previously mentioned papers focus on the risk-sensitive average criterion by using the 
vanishing discount approach, i.e., by looking at the /3-discounted problem and by letting /? go to 
1. In [6] a finite state and action space is considered and emphasis is laid on numerical aspects 
of the problem. A discrete-time linear quadratic risk-sensitive stochastic control problem with 
incomplete state information is solved in m- 

Our paper is organized as follows: In the next section we introduce the underlying POMDP 
and define general history-dependent (deterministic) policies for this model. In section we 
consider the finite horizon general risk-sensitive problem and introduce continuity and com¬ 
pactness assumptions which will guarantee the existence of optimal policies. Then the problem 
is embedded into a suitably defined Markov Decision Process where the state space contains 
among others a joint conditional distribution of the unobservable state and total accumulated 
cost so far, given the observed process. An updating-operator is defined to create a forward 
iteration of this joint conditional distribution. The main theorem of this section (Theorem 3.3) 
states the validity of the embedding procedure and the existence of optimal policies. Section 
1^ contains some important special cases. Among them the situation where the cost function 
does not depend on the unobservable state in which case the updating operator simplifies to the 
updating operator for classical risk-neutral POMDPs. In case the exponential utility function is 
used, we rediscover some results of the previous literature. We also consider the case of a power 
utility where we only get a slight simplification. In Section we consider a simple risk-sensitive 
Bayesian house selling problem. We prove the existence of so-called ’reservation levels’ which 
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can be seen as thresholds for the acceptance of an offer. These reservation levels depend only 
on the conditional distribution. In the last section we consider the problem with infinite time 
horizon and distinguish the case of a convex and a concave utility functions which require sep¬ 
arate proofs due to different inequlities. The main theorems (Theorem 6.1 Theorem 6.2) show 
that the value function of the problem can be obtained from a fixed point equation and that 
an optimal policy exists which is not stationary but still can be generated by only one decision 
function. 


2. General Partially Observable Risk-Sensitive Markov Decision Processes 

We suppose that a partially observable Markov Decision Processes is given which we introduce 
as follows: We denote this process by Tn)neNo assume that the state space is Ex x Ey 
where Ex and Ey are Borel spaces, i.e., Borel subsets of some Polish spaces. The x-component 
will be the observable part, the y-component cannot be observed by the controller. Actions can 
be taken from a set A which is again a Borel space. The set D C Ex x A is a Borel subset 
of Ex X A. By D{x) := {a G A : {x,a) G D} we denote the feasible actions depending on the 
observable state part x. We assume that D contains the graph of a measurable mapping from 
Ex to A. There is a stochastic transition kernel Q from D x Ey to Ex x Ey which determines 
the distribution of the new state pair given the current state and action. So Q{B\x,y,a) is the 
probability that the next state pair is in R G B{Ex x Ey), given the current state is {x,y) 
and action a G D{x) is taken. In what follows we assume that the transition kernel Q has a 
measurable density q with respect to some fi-finite measures A and v, i.e., 

Q{B\x,y,a)= [ q{x',y'\x,y,a)X{dx')iy{dy'), B G B{Ex x Ey). 

JB 

For convenience we introduce the marginal transition kernel density by 



We assume that the initial distribution Qq of Yq is known. Further we have a measurable one- 
stage cost function c : D x Ey —>■ M+. We assume in particular that the cost c{x,y,a) also 
depends on the unknown state part y. Finally we have a discount factor (3 G (0,1]. 

Next we introduce policies for the controller. Here it is important to consider the set of 
observable histories which are defined as follows: 

Ho := Ex 

En Hfi—i X A X Ex- 

An element hn = {xo,ao,xi,... ,Xn) G Hn denotes the observable history of the process up to 
time n. 

Definition 2.1. a) A measurable mapping : Hn —?• A with the property gn{hn) G D{xn) 
for hn G Hn is called a decision rule at stage n. 
b) A sequence vr = [go, gi, ■ ..) where is a decision rule at stage n for all n, is called policy. 
We denote by H the set of all policies. 

3. Finite Horizon Problems 

In this section we consider problems with finite time horizon N. For a fixed policy vr = 
(50)51) ■ • ^ n and fixed (observable) initial state x G Ex, the initial distribution Qo together 

with the transition kernel Q define by a theorem of lonescu Tulcea a probability measure Pfy 
on [Ex X Ey)^~^^ endowed with the product cr-algebra. More precisely is the probability 
measure under policy vr given Xq = x and Yq = y. Later we also use the probability measure 
^^x(') •= J^xyi')Qo[dy)- For u) = [xo,yo, ■ ■ ■ ,XN,yN) £ [Ex X Ey)^~^^ we define the random 
variables Xn and Yn 'm a, canonical way by their projections 

A^n(^) — ^n, Yn[ui) — yn- 
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If TT = (g'o,5i, • • •) G n is a given policy, we define recursively 

^0 := ffo(^o) 

■^n ■ — ■ ■ ■ : 

the sequence of actions which are chosen successively under policy vr. We assume that the 
decision maker is risk averse and has a utility function U : M+ —?• M which is continuous and 
strictly increasing. The optimization problem is defined as follows. For vr G 11 and Xq = x 
denote 

N-l 

k=0 


JNnix) .— 


e: 


IEy 


xy 


Qo{dy) 


and 


JNix) := inf Jn7t{x). 

ttGII 


(3.1) 


Note that in case U(x) = x we end up with the usual risk neutral Partially Observable Markov 
Decision Process setup (see e.g. pQ chapter 5, [12] chapter 4). Here however, if U is strictly 
concave, then t/ is a utility function and U~^{Jn(x)) represents a certainty equivalent. If U is 


concave, we can see from (1.1) that the decision maker is risk seeking and if U is convex, then 


the decision maker is risk averse. 

In what follows we show how to solve these kind of problems by using an embedding technique. 
In order to later ensure the existence of integrals and optimal policies we make the following 
assumptions (A): 

(i) U : [0, oo) —)> M is continuous and strictly increasing, 

(ii) D{x) is compact for all x G Ex, 

(hi) X I—)• D{x) is upper semicontinuous, i.e. for all x G Ex it holds: If Xn —>■ x and an G D{xn) 
for all n G N, then (an) has an accumulation point in D(x), 

(iv) (x, y, a) i-A c(x, y, a) is continuous, 

(v) (x,y,x',y',a) i—)• q(x',y'\x,y,a) is continuous and bounded. 

(vi) c is bounded, i.e., there exist constants 0 < c < c with c < c(x, y, a) < c. 

Remark 3.1. Note that these assumptions are quite strong, however include in particular the 
case when state and action spaces are hnite. (A)(ii-v) also ensure the existence of optimal 
policies for risk-neutral POMDP. 


In [2| we have solved problem (3.1) for the observable case by extending the state space 
to include the accumulated cost so far. Now in the unobservable model, the state y and the 
accumulated cost so far cannot be observed because it depends on y. Thus, we proceed as in 
risk-neutral POMDPs (see e.g. [13 EDI) and consider probability measures y, on Ey x M_|_: 


y G 


Fi,(Ey X M_|_) := |/r is a probability measure on the cr-algebra B(Ey x M_|_) such 

that there exists a constant K = K(y) > 0 with y(EY x [0, i^T]) = l|. 

y plays the role of the conditional distribution on the larger state space of hidden state compo¬ 
nent and accumulated cost. The precise interpretation will be seen in Theorem |3.2[ In order to 
solve the optimization problem, we need, as in the risk-neutral case, an updating procedure for 
the conditional distributions which generates the filter process. The following updating-operator 
ik : Ex X A X Ex x Pft(Hy x My) x 1R+ —>• Fi,(Ey x M_|_) will do the task: 


'^(x, a, x', y, z){B) := 


f f ( f Q(x',y'lx,y,a)E(dy')Ss+;,c(x,y,a)(ds'))y(dy,ds) 

EyR+ ' 


Iey Q^{x'\x,y,a)y'^{dy) 


(3.2) 


where B G B{Ey x M+) and y^ (dy) := /r(dy,M+) is the T-marginal distribution of y. Later we 
will also need the S'-marginal y^{ds) := y(EY,ds). We define the updating operator only when 
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the denominator is positive. For n G N, hn := (xq, oq, • • ■, Xn) and B G B{Ey x M+) define now 
a sequence of probability measures 


P‘o{B\ho) 
l^n+l {,B\hji^ CL, X ') 


{Qo®6o){B), 

'^{xn,a,x',Hn{-\hn),/3'^){B). 


(3.3) 


The next theorem shows that the sequence of probability measures (fJ-n) has the intended 
interpretation. For this purpose define the r.v. 


n—1 

So:=0, Sn-.= Y.P’^c{Xk,Yk,Ak), uGN. 

k=0 


We then obtain: 


Theorem 3.2. 

that 


Suppose (pn) is given by the recursion (3.3). For n G No and all tt gH it holds 


Fl{{Yn,Sn)GB\Xo,Ao,...,Xn)=fin{B\Xo,Ao,...,Xr,) K-a.s., forBGl3{EYxR+). 
Proof. Recall that PJ(-) := J Ff,y{-)Qoidy). We first show that 


e: 


X (^0) ^0) ) ■ • • ) Yn, Yn, Sn ) 


= El 


v'{Xo,Ao,Xi,...,Xn) 


(3.4) 


for all bounded and measurable v : Hn x Ey x M_|_ —)• M and 



x{hn, yn, Sn) t^n{dyn, dSnlhn) • 


We do this by induction. For n = 0 both sides reduce to f v(x,y,0)Qo(dy). Now suppose the 
statement is true for n — 1. We simply write gn instead of gn{hn)- We obtain for the left-hand 
side with a given observable history hn-i'. 


El 


x{hn—l,An—l,Xn,Yn,Sn) — / / hn—lidyn—l^dSn—llhn—l) 

J Ev M-i- 


/ Ey 

• / / l'{dyn)X{dXn)q{Xn,yn\Xn-l,yn-l,gn-l) 

J Ey 'J Ex 

/ '^s„_i+/3"“ic(a:„_i,j/„_i,g„_i)(^®n)x(/ln—1, S')!—1) Sn.) 

XR+ 

I I Fn—l{dyn—l,dSn—l\hn—l) I I k'(^dyn)X(^dXn)Q{Xn,yn\Xn—liyn—li9n 

J Ey J Ey J Ex 

‘v{hri—\') Qn—l'i ?/n? ^n—l “t“ /3 c(^Xn—i^ . 
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For the right-hand side we obtain (where we insert the recursion for /i„ in the third equation 
and use Fubini’s theorem, so that the normalizing constant of /i„ cancels out): 


e: 


V {hn—l}An—l,Xn) 


' Ey '^M 4 - 


/^n—1 1; dSn—\\hn—\) 


'Ex 


\{dXYt^(l {Xri\Xri—\'iyn—l')9n—l)'^ {hn—ltSn—li^n) 


/ ^Jt^_i{dyn-l\hn-l) / X{dxn)q^{xn\xn-l,yn-l-,gn-l) 
J Ey J Ex 


> Ey J M_|_ 


9n{dyfi^ dSn\hyi')v{hfi—i^ gn—l-, Xn-, yni 


/ / v{dyn)X{dXn) / / yn-l{dyn-l-,dSn-l\hn-l)q{Xn,yn\Xn-l-,yn-l-,gn-l) 

J Ey J Ex d Ey J R-|_ 

/ 1 ) 9n— 1 ) 2 /n) Sn) 

JR+ 

I I l^n—l{dyji—lidSn—l\hn—l) I I ^idyri)X{dXri)q{^Xniyn\Xn—liyn—lign—l) 

J Ey J Ey J Ex 

•v{hn—li gn—li Xfiiyni Sn—l /3 c(Xj^_i, )) . 


Thus equation (3.4) is proved. It implies in particular for v = 1_bxC with B G B{Ey x M+) and 
C C Ex X ^4 X ... X Ex a measurable set of histories until time n that 

PJ ((T„, Sn) G B, (Xo, Ao, . . . , Xn) G c) = EJ [gn{B\X^, Ao, • • • , ^n)lc((^0, 4lo, . . . , Xn))] • 

This in turn yields by definition that /r„(i?|Xo, Aq, ..., Xn) is a conditional PJ-distribution of 
(Yn, Sn) given the history (Xq, Ao,...,Xn). □ 


Now we turn again to the optimization problem (3.1). Motivated by the previous result we 


define for x G Ex, g G Pb(i?y x M+), z G (0,1] and n = 1,..., A^: 

p f r n—l 

Vnn{x,g,z) ■= Ky u(s + zY,P^c{Xk,Yk,Ak)') 

J Ey ^R_|_ k=0 


g{dy,ds) 


Vn{x,g,z) := inf Vnn{x,g,z). 
ttGII 


(3.5) 

(3.6) 


Obviously we have that Jn{x) = Vx{x,Qq ® Jq) 1) where 5x is the Dirac-measure at the point 


X G M. However, problem (3.6) can be solved with the general theory of POMDP and m by 
dehning a suitable MDP. For this purpose let us dehne for a probability measure g G P(Hy) 

Q^{B\x,g,a) := f f q^{x\x,y,a)g{dy)X{dx'),B€B{Ex) 

Jb Jey 


We consider a Markov Decision Process with state space E := Ex x ¥f,{EY x ]R_|_) x (0,1], 
action space A and admissible actions given by the set D. The one-stage cost is zero and the 
terminal cost function is Vo{x,g,z) := f f U(s)g(dy, ds). Note that for all g G Fh{EY x M+) 
the expectation is well-defined since the support of g in the s-component is a compact set. 
The transition law is given by Q{-\x, g, z, a) which is for {x,g,z,a) G E x A, a £ D{x) and a 
measurable subset B G E dehned by 

Q{B\x, g, z, a) := / 1b(^^(^x','^{ x, a, x', g, z), f3z)')Q^ {dx'\x, g^, a). 

Note that Q is again a transition kernel. Decision rules in the MDP setting are given by 
measurable mappings f : E ^ A such that f{x,g,z) G D{x). We denote by F the set of 
decision rules and by H^ the set of Markov policies vr = (/o,/i, ...) with /„ G F. Note that 
‘Markov’ refers to the fact that the decision at time n depends only on x, g and z. Further 
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note that we have IT™ C 11 in the following sense: For every vr = (/o, /i,...) G 11™ we find a 
O' = ( 50 ) <?i) • • •) ^ n such that 


, z) € E 


ffo(xo) ■■= fo{xo,no,l), 

dni^hn) ■— fn(^Xn, Hn{'\hn) j P , Tl G N. 

With this interpretation Vmr is also defined for tt G 
Let us now introduce the set 

C{E) := |u : i? —)■ M : u is lower semicontinuous and v > Vo|, 

where we use the topology of weak convergence on ¥i,{Ey x M+). For v G C{E) and f G F we 
consider the operator 

{Tfv){x,n,z) := v{x' ,^{xj{x,n,z),x' ,^l,z),|3z'^Q^ {dx\x,^/ (x,/i, 

which is well-defined. The minimal cost operator of this Markov Decision Model is given by 
{Tv){x, fj,, z) := inf / vlx',^{x, a,x', z),/3 z]Q^{ dx'\x, ,a), {x,fj,,z)GE (3.7) 

a£D(x) Jex ^ ^ 

which is again well-dehned and J/Vq > TVq > Vq (see also the proof below). Note that Vq G 
C{E). If a decision rule f G F is such that Tjv = Tv, then / is called a minimizer of v. We 
obtain: 

Theorem 3.3. It holds that 

a) For a policy tt = (/o, /i, / 2 , • ■ ■) G 11^ we have the following cost iteration: 

Vnn = Tfg--- 7/„-i Vb forn = l,...,N. 

b) 14 G C{E) and 14 = TI 4 - 1 , for n = 1,..., N, i.e., 


G E. 


Vn+i{x,i 2 ,z) = inf / Vn{x',^{x,a,x',i2,z),/3z)Q^{dx'\x,fi^,a), {x,p,,z) 

aeD(x) Jex V / 

The value function of (3.1) is then given by Jn{x) = Vn{x,Qo ® boi !)• 

c) For every n = 1, ..., N there exists a minimizer f*GFof I4-i ond {g^,..., g*E_i) with 

g*n{hn) := fE_n{xn,ian{-\hn),P'^), n = 0, . . . , iV - 1 


is an optimal policy for problem (3.1). Note that the optimal policy consists of decision 
rules which depend on the current state and the current joint conditional distribution of 
accumulated cost and hidden state. 

Proof. The proof of part a) is by induction. For n = 1 we obtain with a := fo{x, fi, z): 
Tf^Vo{x,n,z) = Vo(^x','I>{x,a,x',iJ,,z),/3z'^Q^{dx\x,p.^ ,a) 

u{s')ds+zc{x,y,a) {ds')q^{x'\x, y, a)X{dx')ia{dy, ds) 


J Ey IK+ ^ Ex ^ 

= / / U{s +zc{x,y,a))y{dy,ds) 

J Ey J IR_|_ 

= ViT,{x,y.,z). 


Suppose the statement is true for Wtt- In order to ease notation we denote for a policy vr = 
(/o, fi,f 2 ,...) G by vf = (/i, / 2 ,...) the shifted policy. Moreover let again a := fo{x, y, z). 
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Then 




I Ex 

n—1 


' Ex Ey 


K',y' 


U{s' + z Ifc, Ak)) '^{x^a^x', IX, z){dy',ds')Q^{dx'\x, ,a) 


k=0 


E” 


'Ex Ey '^M+ 


[/(s' + ^ ^ /3^c{Xk, Yk, Ak)) Xi = x', Ti = y' 


k=l 


' Ey 


E^ 


• Ey J Ey J Ex I^+ 


q{x', y'\x, y, a)6s+zcix,y,a) {ds')ix{dy, ds)v{dy')X{dx') 

n 

[/(s + zc{x, y,a) + z'^ P’"c{Xk, Yk, Ak)) Xi = x', Yi = y' 


k=l 


q{x', y\x, y, a)ix{dy, ds)v{dy')\{dx') 


e; 


xy 


J Ey J I 
= Vn+l-K{x,yL,z) 


l3^c{Xk, Yk,Ak))\y,{dy, ds) 


k=0 


and the statement in part a) is shown. 

Next we prove parts b) and c) together. Prom part a) it follows that for vr G II^, the value 
functions in problem (3.6) indeed coincide with the value functions of the previously defined 
MDP. From MDP theory it follows in particular that it is enough to consider Markov policies 
n^, i.e., Vn = inf^gn V'nCT = inf^rgnA-^ (see e.g. [13] Theorem 18.4). Next consider functions 
V G C{E). We show that Tv G C{E) and that there exists a minimizer for v. Statements b) and 
c) then follow from Theorem 2.3.8 in |T|. 

We start by proving that [■\x, yX, a) is weakly continuous, i.e., we have to show that 


(x,/i,a)i—;■ J v{x')Q^{dx'\x, ,a) 


(3.8) 


is continuous for all v G Ch{Ex) where Cb{Ex) is the set of bounded, continuous functions on 
Ex- Obviously =► /r implies that => yX where => denotes weak convergence. From our 
standing assumption (A)(v) it follows that Q{-\x,y,a) is weakly continuous. Hence we obtain 
from Theorem 17.11 in m that the function in ( |3.8[ ) is continuous. 

Next we show that 

(x, a, X , y, z) I—)■ T(x, a, x , y, z) 

is continuous at all points where T is defined, i.e., if (x„, an, x'n, yn, Zn) converges to (x, a, x', y, z) 
in Ex X Ax Ex x FbiEy x M+) x (0,1] it follows that '^{xn,an,x'n, yn, Zn) '^{x,a,x', y,z) 
where {xn,an,x'n, yn,Zn) and {x,a,x', y, z) are such that {x'n\xn,y,an)yX{dy) > 0 and 

{x'\x,y,a)yX{dy) > 0. Hence for v G CbiEy x 1R+) consider 


• Ey 


v{y', s')T(x, a, x', y, z){dy', ds'). 


If we plug in the definition of T we get a quotient whose numerator and denominator will be 
investigated separately. For the numerator we obtain 


/ Ey 


' Ey 


v{y', s + zc{x, y, a))q{x', y'\x, y, a)u{dy')y{dy, ds) 


which is continuous by assumption (A)(iv,v) and Theorem 17.11 in [T3|. The denominator 

/ q^{x'\x,y,a)y^ {dy) 

JEv 
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is continuous in {x,a,x',fj.) by the same reasoning. Hence 'h is continuous. 

Now suppose V G C{E). Taking into account assumption (A), it obviously follows that 

{x,x',a, z) I— )■ v(^x','i>{x,a, x', z), /3z'^ is lower semicontinuous. Again we apply Theorem 

17.11 in [13] to obtain that {x,fj,,z,a) i—)■ f v^x', 'i'(x, a, x', fi, z), /3z^Q^(dx'jx, , a) is lower 
semicontinuous. Note here that continuity of T at those points where the denominator is posi¬ 
tive is sufficient, since the other points form a null-set. By Proposition 2.4.3 in [1] it follows 
that {x,fi,z) I—)■ {Tv){x, fi, z) is lower semicontinuous and there exists a minimizer of v. 

The inequality Tv >Vo is obtained from 


> 


> 


/ v(x', 'h(x, a, x', iJ., z), /dz'] {dx'\x, , a) 

J Ex 

f f U{s')'i>"^{x,a,x',fi,z){ds')Q^{dx'\x,fx^,a) 

JEx^ J M_|_ 

/ / U[s + zc{x,y,a)) / {x'\x,y,a)X{dx')y{dy,ds) 

J Ey M._|_ J Ex 

/ / U{s)fi{dy,ds) = Vo{x,y,z) 

J Ey 


which implies the statement. 


□ 


Remark 3.4. Note that y i—)• Vn-wix^ y, z) is by dehnition a linear mapping and thus y i—)■ 
Vn{x,y,z) is concave. 

Remark 3.5. Since Vq G C{E), TVq > Vo and since the T-operator is monotone, Vn = T^Vq is 
increasing in n. 


Remark 3.6. Of course instead of minimizing cost one could also consider the problem of 
maximizing reward. Suppose that r : H —)> [r, f] (with 0 < r < f) is a one-stage reward function 
and the problem is 


N-l 


Jn{x) := sup 

crGlI . 


e; 


Ey 


xy 


u 


r{Xk, 


Qo{dy), X G Ex- 


k=0 


(3.9) 


It is possible to treat this problem in exactly the same way using straightforward modifications. 


4. Some Special Cases 


4.1. The cost function does not depend on the hidden state. An important special 
case is obtained when the one-stage cost function does not depend on the hidden state y, i.e., 
c{x,y,a) = c{x,a). In this case the cost which has accumulated so far is always observable. 
The recursion for the joint conditional distribution yn{-\hn) of cost and hidden state simplihes 
considerable. In order to explain this, we define the operator : Ex xAx Ex x ¥{Ey) —?■ E(i?y) 
by 


^{x,a,x',y){B) 


Ib Iey ^)Kdy)^W) 


B G 13{Ey). 


Note that <I> is exactly the usual updating (Bayesian) operator which appears in classical POMDP 
(see e.g. [Tj, section 5.2). It updates the conditional probability of the unobservable state. In 
what follows denote by (yt) the sequence of probability measures on Ey generated by with 
/Tq '■= Qo- Then we obtain; 


Proposition 4.1. Suppose c{x,y,a) 
can be written as 


c{x, a) is independent of y. 


Then yn{-\hn) from (3.3) 


yn{BiX B2\hn) = yn{Bi\hn) ■ yn{B2\K), where Bi X B 2 ^ B{Ey X (4.1) 

with y^i-\hn) = yni-\hn) = yt{-\K)- 
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Proof. The proof is by induction on n. The statement for n = 0 is true by definition. Now 
suppose the statement is true for n. We obtain with hn+i = {hn, an, x'), Xn = x and an = o: 




Iey /m+ /bi Ib 2 Qix'^y'\x^y^(^)^W)^s+|3-■c(x,a){ds')^lX{dy\hn)^i^{ds\hn) 
Iey V: a)/^n {dy\hn) 


fsi Iey idy\hn)E{dy') 


ds+/3"c{x,a ){B2)f^nids\hr. 


Noting that y.n{'\dr, 


Iey Q^(^'l^^y^(^)yn(dylhn) 

= ^(x,a,x',/^^(-lhn))(Bi} • 6j2^^^pk^{xk,ak)iB2). 

= yLn{-\hn) by the induction hypothesis, the statement follows. 


□ 


Thus, the problem simplifies considerably since instead of probability measures on xM+) 
we only need to consider probability measures on B{Ey) together with an observable sequence of 
accumulated cost. We can interpret the embedding MDP as one with state space Ex x ¥{Ey) x 
M+ X (0,1] and the value iteration reads 

Vo{x,fi,s,z) := U{s) 

Vn+i{x, fj., s, z) = inf / Vnix',^{x,a,x', y,), s + zc{x,a), I3 z]Q^{ dx'\x, 
aeD{x) J \ / 

for {x,y,s,z) G Ex x P(illy) x M+ x (0,1], 
where has been defined in the previous calculation. 


Remark 4.2. In case there is no unobservable component, i.e., we have a completely observable 
risk-sensitive MDP, the updating operator T : Ex x x Ex x P(]R+) x (0,1] —)• P(M+) boils 
down to 

^{x,a,x',y,z){B) = / 5s+zc{x,a)y{ds), B e B{^+) 

Jb 

and we obtain yn{B\hn) = ak)^dd)- Hence the updating process is deterministic and 

instead of y we can simply store the accumulated cost so far. The value iteration then reads 

Vq{x,s,z) = U{s), (x, s, z) G Plx X 1K+X (0,1] 

Vn+i{x,s,z) = inf / Vn{x', s + zc{x,a), zl3)Q{dx'\x,a), 
aeD(x) J 

which is exactly the situation which has been investigated in [2]. 


4.2. A particular class of partially observable control models. The transition law of the 

process (A„, we consider here is quite general. For other general models see Chapter 

4 in m- All these general models contain in particular the following class which appears very 
often in applications (in particular this is the starting point in mM)- 

^n+l — d(Yn) T yn+1 
hn +1 — b(Yn, -^n) T Vn+l 

where {Sn) is a sequence of independent and identically distributed random variables with density 
lys and (r/„) is a sequence of independent and identically distributed random variables with 
density iprj. Both sequences are assumed to be independent and we assume for simplicity that 
Ex = Ey = We consider here an additive noise but this can also be part of the functions b 
and h respectively. The transition law under a policy vr is for Bi,B 2 G H(M) given by 

Q{Bi X B2\x,y,a) = P (An+i G Hi, y„+i G H 2 I= x, W = y, = a) 

= F {h{y)+r]n+i & Bi,b{y,a) + Sn+i £ B 2 ) 

= / Pn{w - h{y))dw j ipe{v - b{y,a)))dv. 

J B\ J B 2 
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According to assumption (A)(v) the resulting density q has to be continuous and bounded in 
all variables. This is for example satished if b, h are continuous and (p^i are continuous and 
bounded densities, like e.g. the Gaussian density. 

4.3. Total costs criterion. In case /3 = 1, the costs are not discounted and we minimize the 
utility of the total costs 

N-l 

c{Xk, Yfc, Ak). 

k=0 

In this case the z-component of the iteration in Theorem |3.3| b) does not change. Since in general 
we start with z = 1, we can just skip it and obtain the simpler recursion for n = 0 ,..., — 1 


Vq{x,p) '■= J J U{s)ix{dy,ds) 

I4+i(x,/r) = inf / Vnix',^{x,a,x',ij)]Q^{dx'\x,yL^,a), (x,/r) G x x M+), 

aeD{x) Jex ^ ^ 


where 'I'(x, a, x',/i) := 4'(x, a, x',/r, 1) from (3.2). Indeed the z-component is equivalent to the 
knowledge of the time step but since we would like to consider a general problem it makes sense 
to introduce this component in the model setup in Section 

4.4. Exponential Utility fnnction. In this section we assume now that the utility function 
has the special form U{x) = with 7 7 ^ 0. This situation is often referred to as the usual risk- 
sensitive problem. Partially observable problems in this setting have already been considered 
in [H [la mi El ESI E]. However still in this case our model is far more general than in the 
previous literature where the filter is derived with a change of measure technique. As we have 
shown in (3.3) such a measure transformation is not needed for the computation of the filter. 

Our aim is to specialize the value iteration from Theorem 3.3 to this case. In order to do this 
define for y G Fi,{Ey x M+): 

Ib 4+ eT-Xciy, ds) 


KB) ■■= 


/r+ eAK^ids) 

which obviously yields a new probability measure on ¥{Ey)- 


B G B{Ey) 


(4.2) 


Remark 4.3. From Theorem 3.2 it follows directly that jd has a certain interpretation. We 
obtain for /r„ from Theorem |3.2| that 

e-^^ynidyKslhn) = 


If fin is the normalized version of this expression then it coincides with the ’information vector’ 
dehned e.g. in EaE]. Note that we obtain (In in a very natural way as a special case of our 
general yn in Section 


Further we can write for (x, y, z) G E\ 


Vnix,y,z) = 


e"’'® inf — 


n—1 


e: 


7 JEy 


xy 


1 


exp ('yz^KciXk,Yk, 


k=0 


Kdy,ds) 


n—1 


e: 


/ 

Jr. 


e^^Kids) ■ inf - 

7 Jey 

e<’'K{ds) ■ en{x,(j,,'yz) 


'xy 


expKz'^Kc{Xk,Yk,Ak)] Kdy) 


fc=o 


Using this representation, the value iteration in Theorem 3.3 can be restricted to the functions 
e„. The state space Ex x E(Ey) x (0,1] is much simpler because measures are only concentrated 
on Ey- 
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Theorem 4.4. a) For (x, z) G Ex x F{Ey) x (0,1] it holds that eo(x, fi, 72 ) = ^ and for 
n = 1, . . . , N 

e„+i(x,/x, 7 z) = inf / e„ (x', 4'e(a:, a, x',/x, z),/ 372 ; a, 

aeDix) J Ex ^ ^ 

where for Bi G B{Ex), B2 G B{Ey) 

Q^{Bi\x,fj.,a,z) := f f {x'\x,y,a)fj.{dy)X{dx'), (4.3) 

J B\ J Ey 

/b 2 Jey y'\x, y, a)y{dy)v{dy') 


^e{x,a,x\yL,z){B2) := 


(4.4) 


Ey Iby e^'^(^’2^’“)g(x', y'\x, y, a)y{dy)E{dy')' 

The value function of (3.1) is then given by Jn{x) = exix,Qo,'y)- 
b) For every n = 1, ..., N there exists a minimizer f*GF of Gn-i and (gfg,..., g*x-i) with 

9nihn) := fN_n{xn,9n{-\K),j(d"^), n = 0,...,N -1 

is an optimal policy for problem ( |3.1[ ) where the sequenee [yff] of posterior distributions 
is generated by the updating operator 'I'e with yf, '■= Qo- 

Proof. Let (x, /x, z) G E. On one hand we have that 

Vn+i{x,y,z) = / e'^>‘^(ds) • e„+i(x,/x,7z), 

M.-|_ 

the other hand we have by Theorem |3.3[ 

14 +i(x,^, 2 ;) = inf / Vn{x',^{x,a,x', y, z), Pz)Q^{ dx\x, ,a) 
a.&D{x)JEx ^ > 


on 



'L'^(x, a, x', y^ z){ds') ■ e„ ^x', 'L(x, a, x', y, z), P'yz'^ {dx'\x, yX, a) 


= inf 

a&D{x) Jex Jm+ 

= ^ / / / / (^^'‘^^''‘'^"'’^’'''’d{x',y'\x,y,a)y{dy,ds)u{dy') ■ 

a&D(x) Jex JEy JEy JK+ 


e„(x',4'(x,a,x',/x,z), 


;x 


e^V^(ds) • 



inf 

aeD{x) Jex JEy JEy 


y'|x, y, a)y{dy)v{dy)en (^x, ^(x, a, x', y, z),P'yz^ X{dx] 


= / e'^^y^{ds) ■ inf / e„ (x', ^(x, a, x^,/x, z),/37z A, o, 72 :)- 

aeD{x) Jex ^ y 

It remains to show that 'L(x, a, x', /x, z) = 'Le(x, a, x', y, jz) which is defined in (4.4). We obtain 
for B G B{Ey): 

e^^'^ix, a, x',y,z){dy',ds') 

'i>{x,a,x , y,z){B) = ^ - y— -;- ,, , , ——- 

/sy '^{x,a,x',y,z){dy',ds') 

Ib Iey /r+ a)e'^^+'^^‘^^=^'y’°-')y{dy, ds)v{dy') 

Iey Iey /k+ y'l®’ 9’ a)e'r^+'r^‘=(^’y’'^)y(dy, ds)u{dy') 

Ib Iey y'\^' y^ a)e'^^<^’y’^'ly{dy)u{dy') 

Iey Iey y(^'’ y'\^' y^ a)e'r^<^’y’^)y{dy)u{dy') 

= ^eix,a,x,y,'yz){B). 

Hence part a) is shown. Part b) follows as in Theorem 3.3 c). □ 
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Remark 4.5. If (fin) is generated by 'I' with //q := <3o<S)<Jo (note that are probability measures 
on B{Ey X then fi{-\hn) = i.e., is the sequence of information vectors (see 

Remark (4.3)). The statement follows directly from the proof of the previous theorem. 


4.5. Power Utility function. In this section we assume that the utility function has the special 
form U{x) = with 7 7 ^ 0. Thus, we obtain: 


1 


n—1 


Vn{x,iJ,,z) = inf- 


e; 


7 JEy 7IR+ 


= z"^ inf 


-f , 

7 JEy 7m+ 


e: 


xy 


= z'^ inf — 


e; 


7 JEyJR+ 


xy 


{s + z'^l3’"c{Xk,Yk,Ak)y fj{dy,ds) 
k=0 

+ '^l3^c{Xk,Yk,Ak)y ij{dy,ds) 
k=0 
n—1 

{s + ^l3'"c{Xk,Yk,Ak)y y{dy,ds) 


s 
l' z 


n—1 


k=0 


= : z'^dn{x,fl), 

where y is defined by Jl{Bi x B 2 ) := y{Bi x ^^ 2 ) for Bi x B 2 € B{Ey x M-|-). Hence fl G 

Efe(E^Y X M+). 

Theorem 4.6. a) For (x,/i) G Ex x Eb(Sy x M+) it holds do{x,fj,) ■= ^ f f s'^y{dy,ds) 
and for n = 1,..., N 


dn+i{x,y)= inf /3'^ dn{x',^p{x,a,x',iJ,))Q^{dx'\x,y,a), 

aeD{x) Jex ^ ' 

where for B G B{Ey x M+) 

Iey /m+ (/B9(a^'>y'k,y,a)p((iy')'5£±£(£^(ds'))l^('^2/>^^s) 

Tp(x,a,x',/i)(R) := •- . - 

Jey ^ {x \x,y,a)y^{dy) 


The value function of (3.1) is then given by Jn{x) = dx{x,QQ ® 5o)- 


b) For every n = 1,... ,N there exists a minimizer f*£F of dn-i and (gfg,..., g*x_i) with 


9n{hn) := fx_n{Xn,iJri{-\hn)), n = 0, . . . , iV - 1 
is an optimal poliey for problem (HI, where the sequenee {fJn) is generated by Tp with 

ho ■= Qo ® <^o- 

Proof. On one hand we have shown 


Vn+l {x, h, z) = Z^dn +1 {X,fl). 
On the other hand we obtain with Theorem 13.31 


Vn+i{x,h,z)= inf / Vn{x',^{x,a,x,h,z),/3z)Q^{dx\x,h^,a) 

aeDix)JEx ^ ^ 

= inf fPz'^ f dn(x',l>{x,a, x', h, z), (3 z')Q^{ dx\x, , a). 

a&D{x) Jex ^ ’ 

It remains to show that T(x, a, x', z) = 'I'p(x, a, x',/2). 
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Here we obtain for B G B{Ey x M+): 


a, x', /i, z){B) 


Iey /k+ ( /b a)iyidy')6 |+c(.,^,a) (^sQ) fijdy, ds) 

Iey /r+ /r+ Iey Qix',y'\x,y,a)E{dy')5^+^^^{ds')n{dy,ds) 

Iey /ir+ ( fj^Qix\y'\x,y,a)E{dy')6 s+c(^,y,a) ids')]fi(dy,ds) 
Iey /k+ /r+ Iey a)z^(di/0(i s+c(.,^.a) {ds')il{dy, ds) 


= '^p{x,a,x',il){B). 


Hence part a) is shown. Part b) follows as in Theorem 3.3 c). 


Remark 4.7. If (//„) is generated by T with /iq := Qo ® then = /Xn(-|hn,). 

statement follows directly from the proof of the previous theorem. 


□ 

The 


Remark 4.8. Note that the special case U{x) = log(x) can be treated similar. It can also be 
obtained from the power utility case by letting 7 —)■ 0 . 


Remark 4.9. Also the updating operators Tg and Tp simplify considerably if the cost function 


c{x,y,a) is independent of y (see Section 4.1). 


5. Application: Risk-Sensitive Bayesian House Selling Problem 

As an application we consider a risk-sensitive Bayesian extension of the classical house sell¬ 
ing problem with finite time horizon. We assume that offers for a house Xq, ..., arrive 
independently and are identically distributed with distribution Qq. Here 0 G 0 is an unknown 
parameter and 0 is assumed to be a Borel space. Further we assume that Qq has a A-density 
q{x\6) which is continuous in both parameters with compact support. A prior distribution Qo 
for 6 is given. As long as offers are rejected an observation cost of cg > 0 has to be paid which 
also depends on 6 and cannot be observed. We suppose that cg is continuous in 6. When an 
offer is accepted, the price is obtained and the process ends. If one has not stopped before N, 
the last offer has to be accepted. The aim is to find the maximal risk-sensitive stopping reward 


JAr(x) 


sup / E^g 
0 <T<Ar Je 


u(^XY-cgT'^]Qo{de) 


(5.1) 


where the supremum is taken over all stopping times r. Here we assume that U : M —)• M is 
strictly increasing and concave. In order to have a well-defined problem we also assume that 
sup^) E 0 [Aj''] < 00 . This risk-sensitive Bayesian house selling problem can be solved in a similar 
way as our general model with = 0 and Ey = 0, i.e., the unobservable component is simply 
the unknown parameter and c{x, 9) = cg (independent of x). However note that we also have a 
terminal reward in case we have not stopped before which equals the last offer. Risk-sensitive 
house selling problems with complete observation have been treated in |16j . A risk-sensitive 
Bayesian house selling problem has been considered in [4] however with fixed observation costs 
c (independent of 9). We define the updating operator T for the joint conditional probability 
of the unknown parameter 9 and the accumulated cost so far only in case we do not stop 
because otherwise the problem ends immediately. Also note that since /3 = 1 we can skip the 
2 ;-component in the state space. Moreover, the i.i.d. assumption on the offers implies that T 
does not depend on x which is the previous offer. The updating operator is given by 


T(x',^)(Ri X B2) 


Jbi /r_ q{x'\9)6s-ce{B2)d‘{d9,ds) 

!e(l{x'\9)y.^{d9) 


Hi X R2 G B{Q X M_). 
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According to Theorem 3.3 we obtain Jw by computing the functions Vn- These are given by 

Vo{x,n) = J U{x + s)iJ,^{ds) =: U^{x) 

Vnix,fi) = max|c/^(x),dn(/r)| 

with h„(/i) := Jjg V^_i T(x', We have that Jn{x) = VAr(x, Qo ® <5o)- Note 

that {dx'\iJL^) is given by 

g^(5|/i®)= [ [ q{x'\e)i2®{de)x{dx'), bgb{Ex). 

J B Je 

When we define fn{x, /u) = stop if Ufj,{x) > dn{p) and {qq, ..., g^-i) by 

aUhn) ■= fN-n{xn,Pn{-\hn)), K = {xq, Xl, . . . , Xn), n = 0, . . . , iV - 1, 


then the optimal stopping time for problem (5.1) is given by 

T* := inf{re G Nq : Qni^n) = stop} A N. 


Let us now further investigate the optimal stopping time t* . As in Section we define by 
t‘'n{-\hn) the sequence of conditional probabilities generated by the updating-operator. Then we 
have 

9ni^n) — stop ^^i„{-\h„){Xn) ^ d}\f—n{d"n{'\hn))- 

Since x i—>■ U^{x) is increasing and continuous, the inverse function U~^ exists and we obtain 

~ stop 4A Xji ^ —• X^x{l^n{'\dn)). 

We call a:* jy(-) reservation level. The reservation levels depend on pn and U. The optimal 
stopping time is hence the hrst time, the offer exceeds the corresponding, history dependent 
reservation level. 


Theorem 5.1. a) The optimal stopping time for the risk-sensitive Bayesian house selling 
problem is given by 

T* = inf {n G No : > x*^^xihn{-\hn))} A N. 

b) The reservation levels can recursively be computed by 

xn-i,n{tn-i) = [ [ Uix + s)p%_^{ds)Q^{dx\pff_^) 

i/ M. 1 / M 

x*n,N{Tn) = U-^o / C/vi>(,j,,^„)(^max{x,x;+i_jv(^(a^>/^n))})g^(da;|^®). 

J M 

Proof. Part a) is clear from the definition and the previous results. Part b) can be shown by 
inserting the correct definitions. For n = N — 1 we obtain from the dehnition of x*j.^_i xil^) that 

X*N-l,Nih) = U~\di{p)) 

with 

di{p) = I Vo{x,^{x,p))Q^{dx\p^). 

For X* ^ we obtain by dehnition: 

<,7v(/^) = U-^{dN-nig))- 
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Further dN-nilJ^n) can be written as 

dN-nilJ^n) = / VN-n-l{x,'^{x,fin),)Q^{dx\lJ.^) 

Jr ^ ^ 

= ^max |[/vi>( 3 ,_^^)(x), dN-n-li^ix, IIn))'^Q^ idx\n®) 

= (max {x, O dN-n-l{^{x, Hn))}'^Q^ idx\fi®) 

and the statement follows from the definition of x* 


''n,N- 


□ 


6 . Infinite Horizon Problems 

Here we consider an infinite time horizon and (3 G (0,1), i.e., we are interested in 


Joo{x) := inf 


en 


E' 


Ey 


xy 


uiY,l5^c{Xk,Yk, 


Qo{dy), X e E. 


k=0 


We will consider concave and convex utility functions separately. 


( 6 . 1 ) 


6.1. Concave Utility Function. We first investigate the case of a concave utility function 
U : M+ —)■ M. This situation represents a risk seeking decision maker. 

In this subsection we use the following notations 


Vooaix,n,z) := 


E' 


' Ey 


xy 


u(s + z^P^c{Xk,Yk, 


fc =0 


Kds,dy), 


Vooix, fj,,z) := inf Vooaix,y,z), {x,iJ.,z)eE. 

(tGH 


( 6 . 2 ) 


We are interested in obtaining Voo{x, Qo<S>do, 1) = Joo(x). For a stationary policy tt = (/,/,...) G 

HM 

we write Uoott = Yf and denote 


b(;,,z) := / y(s+-^ 

Jr+ ^ J- — P 


fluids), 


b{n,z) := £ t/(s + Y^)/i‘^(ds), (p, z) G ^^(Fy x M+) x [0,1]. 


Then we obtain the main theorem of this section; 


Theorem 6.1. The following statements hold true: 

a) Uoo is the unique solution of v = Tv in C(E) with b{fi,z) < v{x,fi,z) < b{fv,z) for T 


defined in (3.7). Moreover, T^Vq f VooiT'^b t Wo and T^b | Wo for n —)> oo. The value 


function of (6.1) is given by Joo{x) = Wo(2:) Qo ® !)• 


b) There exists a minimizer f* o/Wo and (^Tq, ...) with 

9n{hn) ■= f*{xn,hn{-\hn),l3 


is an optimal policy for (6.1). 


Proof. a) We first show that Wi = T^Vq f Wo for n —>■ oo. To this end note that for 
U : M+ —>■ M increasing and concave we obtain the inequality 


U{si +S 2 ) <U{si) + U'_isi)s 2 , Sl,S2>0 
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where U'_ is the left-hand side derivative of U which exists since U in concave. Moreover, 
U'_{s) > 0 and U'_ is non-increasing. For (x, fi,z)GE and cr G 11 it holds 


Vnix,^i,z) < Vna{x,^l,z) < Vooa{x,^l,z) 

f f E 


J Ey 

< [ [ E 

J Ey 


xy 


xy 


ui^s + z^l3^c{Xk,Yk,Ak)j n{dy,ds) 
k=0 
n—1 

u(s + z'^(3’^c{Xk,Yk,Ak)) fi{dy,ds) 


k=0 


n—1 




e: 


' Ey t/R+ 


xy 


U'_is + zY,P^c{Yk,Yk,Ak))z Y, ^^c{Xm,Ym,Am) y{dy,ds) 


k=0 


< Vna{x,IJ,,z) + P"'' 


zc 


1-/3 


ZC 

< Vna{x,H,z) +=: Vnaix,y,z) +eniz), 


U'_{s + zc)y^{ds) 


(6.3) 


where Sniz) has implicitly been defined in the last equation. 

Obviously lim„_>.oo en('7) = 0. Taking the infimum over all policies in the preceding 
inequality yields: 

Vn{x,fl,z) < Voc{x,H,z) < Vnix,fJ,,z) + £n{z). 

Letting n —>■ oo yields 14 = T"'Vb f 14o for n —>■ oo. Note that the convergence of T^Vq 


is monotone (see Remark 3.5). 


By direct inspection we obtain b < Voc < b. We next show that 14o = TVoc- Note 
that 14 < 14o for all n. Since T is increasing we have 14-i-i = ^"14 < T14o for all n. 
Letting n —>■ oo implies 14o < TVoo ■ For the reverse inequality recall that 14 + ^n > 14o 


from (6.3). Applying the T-operator yields 14+i Y^n+i = T{Vn + £n) > TV^o and letting 


n —>■ oo we obtain 14o > TVoo- Hence it follows 14o = TVoo- 
Next, we obtain 


{Tb){y,z) = inf / Uls + 

aeD{x) V 1 — P 


/ , 1,T,S 


T* (x, a, x'y, z){ds') 


< 


U{s + ZC + 


zfdc 5 


U{s + 


1-/3 
zc \ 5 


fi^{ds) 


1-/3 


y (ds) = b{y, z). 


Analogously Tb > b. Thus we get that T^b and T^h t and the limits exist. Moreover, 
we obtain by iteration: 


{T^b){x,y,z) = 


inf 


e: 


■k&HM Jey Jr+ 

{TH){x,y,z) = 


xy 


U[s + ^ + zYP’‘<^k,Yk,Ak))\y{dy,ds) > iT^Vo)ix, y, z) 

k" u _n 


n-1 


inf / 

TTSnAr Jey . 


e: 


xy 


u 


S ++z^/3^c(Afc,yfc,Afc)j y{dy,ds). 


k=0 


n—1 


-P 


k=0 
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Using U{si + S 2 ) — U{si) < U'_{si)s 2 we obtain: 

0 < (TH) {x,fi,z)- {T'^b) {x,^,z) < [T'^b) {x,n,z)- {T'^Vo){x, //, z) 

zcP^- 


n—1 


< sup / 
ttGIT J Ey 


E 


xy 


U{s + 


+ zY,/3''c{Xk,Yk, 


1-13 

u(s + zJ2l3’"c{Xk,Yk,A, 


k=0 

n—1 


fj.{dy,ds) 


k=0 


< £niz) 

and the right-hand side converges to zero torn —>■ 00. As a result T"6 | Uoo and T"6 f Kxd 
for n —>■ 00. 

Since Vn is lower semicontinuous, this yields immediately that Uoo is again lower semi- 
continuous, thus Uoo £ C{E). 

For the uniqueness suppose that v G C{E) is another solution of u = Tv with 
b < V < b. Then T^b < v < T^b for all n G N and since the limit n —>■ 00 of the 
right and left-hand side are equal to Uoo the statement follows, 
b) The existence of a minimizer follows from our standing assumption (A) as in the proof 
of Theorem 3.3 From our assumption and the fact that 14o > Vb we obtain 

Uoo = lim r^.Uoo > lim T^,Vo = lim ) = Vf* > Uoo 

where the last equation follows with dominated convergence. Hence ...) is optimal 

for (|6.ip. 

□ 

6.2. Convex Utility Function. Here we consider the problem with convex utility U. This 
situation represents a risk averse decision maker. The value functions Uic 14, Fboo-, Vbo are 
defined as in the previous section. 


Theorem 6.2. Theorem 6.1 also holds for convex U. 


Proof. The proof follows along the same lines as in Theorem 6.1 The only difference is that we 
have to use another inequality: Note that for U : M+ —>■ M increasing and convex we obtain the 
inequality 

U{si -I- S 2 ) < U{si) + U'j_{si -I- S2)S2, Sl,S2 > 0 
where is the right-hand side derivative of U which exists since U in convex. Moreover, 
Ul{s) > 0 and is increasing. Thus, we obtain for (x, h,z)gE and cr G H: 

Vnix,H,z) < Vnaix,fJ-,z) <Voca{x,fV,z) 


E' 


< 


IEy 


'Ey 


xy 


e; 


xy 


< Vna{x,6L,z) + 


U[s + z'^l3^c{Xk,Yk,Ak)j fi{dy,ds) 
k=0 
n—1 

u{s + zY,f^^c{Xk,Yk,. 

k=0 

00 

U'Js + zY, P’^ciXk, Yk, Ak))zY, P'^ciXk, Yk, Ak) yi{dy, ds) 


+ 


xy 

ZC/3^ 


k=0 


k=n 


1-13 




Note that the last inequality follows from the fact that c is bounded from above by c. Now 
denote 6niy, z) := Obviously lim„^oo dn{lL, z) = 0. Taking the 

infimum over all policies in the above inequality yields: 

Vn{x,H,z) < Voo{x,fJ,,z) < Vn{x,fJ.,z) +dn{ll,z). 
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Letting n —> cx) yields T'^Vq —> Voo- 
Further we have to use the inequality 

0 < (TH)[T'^b){x,^,z) < [T'^b){x,n,z)- {T'^Vo){x, //, z) 

n—1 


< sup / 
ttGIT J Ey 


E" 


[/ u + 


zc/3^ 




k=0 


n—1 


u(s + zJ2l3'"c{Xk,Yk,Ak)] ^{dy,ds) 


k=0 


< 


zcf]^ 




and the right-hand side converges to zero for n —)> oo. 


□ 


6.3. Exponential Utility. Of course the result for the inhnite horizon problem can now be 
specialized to various situations like in Section]^ This can be done rather straightforward. We 
only present the case of the exponential utility due to its importance. 

Corollary 6.3. In case U{x) = with 7 / 0, we obtain 

a) Vooix, y, z) = f e^^y^{ds) ■ e^dx, y,'yz), {x,fi,z) G Ex x P(Ey x M+) x (0,1] where (1 
has been defined in (|4.2|) and the function e^c is the unique fixed point of 


eoo{x,n,jz) = inf / ,^e{x,a,d, p,,-fz),/3'yz)Q^ {dd\x, iJ,a,jz), 

aeD{x) J Ex 

for {x,fj,,z) G Ex X P(Ey) x (0,1] with U{d3p) < ^odx, lJ.,'jz) < U{d^)- The value 
function of (6.1) is then given by Jodx) = g^{x,Qq,^). 
b) There exists a minimizer f* of e^o and {gQ,gl,...) with 

g*dhn) ■= f*[xn,dn{-\hn),-ifi''‘ 
is an optimal policy for (|6.1[), where the sequence {dfi) of posterior distributions is gen¬ 


erated by the updating operator 'I'e with /Tq ■= Qo like in Theorem 4.4 


Acknowledgements: The authors would like to thank three referees for helpful comments 
and suggestions which improved the presentation of the paper. 
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